03_space.RmdWe used the R package CoordinateCleaner to flag potentially erroneous, suspect, or imprecise geographical coordinates based on geographic gazetteers and metadata. It includes a series of tests for identifying records assigned to country capital, provinces and country centroids, coordinates in urban areas, around biodiversity institutions or GBIF headquarters. It also contains tests to flag coordinates below a determined precision (e.g., 100 km), zero or equal coordinates, and duplicated records (i.e., equal taxa name and coordinates).
Note that we do not use the “seas” test to remove records in the ocean because such records we previously removed in the pre-filter step of the workflow (more details here).
Important:
The results of each test used to flag data quality is appended in separate fields in this database and retrieved as TRUE or FALSE, in which the former indicates correct records and the latter potentially problematic or suspect records.
You can install the released version of ‘BDC’ from github with:
if (!require("remotes")) install.packages("remotes")
if (!require("bdc")) remotes::install_github("brunobrr/bdc")Creating folders to save the results
bdc::bdc_create_dir()Read the database created in the step taxonomy of the BDC workflow. It is also possible to read any datasets containing the required fields to run the workflow (more details here
database <-
qs::qread("Output/Intermediate/02_taxonomy_database.qs")Standardization of character encoding
for (i in 1:ncol(database)){
if(is.character(database[,i])){
Encoding(database[,i]) <- "UTF-8"
}
}
check_space <-
CoordinateCleaner::clean_coordinates(
x = database,
lon = "decimalLongitude",
lat = "decimalLatitude",
species = "scientificName",
countries = ,
tests = c(
"capitals", # records within 2km around country and province centroids
"centroids", # records within 1km of capitals centroids
"duplicates", # duplicated records
"equal", # records with equal coordinates
"gbif", # records within 1 degree (~111km) of GBIF headsquare
"institutions", # records within 100m of zoo and herbaria
"outliers", # outliers
"zeros", # records with coordinates 0,0
"urban" # records within urban areas
),
capitals_rad = 2000,
centroids_rad = 1000,
centroids_detail = "both", # test both country and province centroids
inst_rad = 100, # remove zoo and herbaria within 100m
outliers_method = "quantile",
outliers_mtp = 5,
outliers_td = 1000,
outliers_size = 10,
range_rad = 0,
zeros_rad = 0.5,
capitals_ref = NULL,
centroids_ref = NULL,
country_ref = NULL,
country_refcol = "countryCode",
inst_ref = NULL,
range_ref = NULL,
# seas_ref = continent_border,
# seas_scale = 110,
urban_ref = NULL,
value = "spatialvalid" # result of tests are appended in separate columns
)
#> Testing coordinate validity
#> Flagged 0 records.
#> Testing equal lat/lon
#> Flagged 0 records.
#> Testing zero coordinates
#> Flagged 1 records.
#> Testing country capitals
#> Flagged 10 records.
#> Testing country centroids
#> Flagged 10 records.
#> Testing urban areas
#> Downloading urban areas via rnaturalearth
#> OGR data source with driver: ESRI Shapefile
#> Source: "C:\Users\Bruno Ribeiro\AppData\Local\Temp\Rtmpi8z1As", layer: "ne_50m_urban_areas"
#> with 2143 features
#> It has 4 fields
#> Integer64 fields read as strings: scalerank
#> Flagged 279 records.
#> Testing geographic outliers
#> Flagged 10 records.
#> Testing GBIF headquarters, flagging records around Copenhagen
#> Flagged 0 records.
#> Testing biodiversity institutions
#> Flagged 11 records.
#> Testing duplicates
#> Flagged 97 records.
#> Flagged 390 of 6112 records, EQ = 0.06.Identification of records with a coordinate precision below a specified number of decimal places. For example, the precision of a coordinate with 1 decimal place is 11.132 km at the equator, i.e., the scale of a large city.
check_space <-
bdc_coordinates_precision(
data = check_space,
lon = "decimalLongitude",
lat = "decimalLatitude",
ndec = c(0, 1) # number of decimals to be tested
)
#> bdc_coordinates_precision:
#> Flagged 50 records
#> One column was added to the database.It is possible to map a column containing the results of one spatial test. For example, let’s map records in country or provinces centroids.
check_space %>%
dplyr::filter(.cen == FALSE) %>%
bdc_quickmap(
data = .,
lon = "decimalLongitude",
lat = "decimalLatitude",
col_to_map = ".cen",
size = 0.7
)
Creating a column named “.summary” summarizing the results of all tests. This column is “FALSE” if any test was flagged as “FALSE” (i.e. potentially invalid or suspect record).
check_space <- bdc_summary_col(data = check_space)
#> Column '.summary' already exist. It will be updated
#>
#> bdc_summary_col:
#> Flagged 467 records.
#> One column was added to the database.Creating a report summarizing the results of all tests.
report <-
bdc_create_report(data = check_space,
database_id = "database_id",
workflow_step = "space")
#>
#> bdc_create_report:
#> Check the report summarizing the results of the space in:
#> Output/Report
reportCreating figures (bar plots and maps) to facilitate the interpretation of the results of data quality tests.
bdc_create_figures(data = check_space,
database_id = "database_id",
workflow_step = "space")
#> Check figures in C:/Users/Bruno Ribeiro/Documents/bdc/vignettes/Output/Figures
Rounded (potentially imprecse) coordinates

Records within biodiversity institutions

Summary of all tests
It is possible to removed flagged records (potentially problematic ones) to get a ‘clean’ database (i.e., without test columns starting with “.”). However, to ensure that all records be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal steps of the workflow), potentially erroneous or suspect records will be removed in the final step of the workflow.
# output <-
# check_space %>%
# dplyr::filter(.summary == TRUE) %>%
# bdc_filter_out_flags(data = ., col_to_remove = "all")